Space-Constrained Interval Selection 



Yuval Emek* Magnus M. Halldorsson''' Adi Rosen-I- 

(N 

Abstract 

We study streaming algorithms for the interval selection problem: finding a maximum car- 
' ^ dinality subset of disjoint intervals on the line. A deterministic 2-approximation streaming 

Ph algorithm for this problem is developed, together with an algorithm for the special case of 

proper intervals, achieving improved approximation ratio of 3/2. We complement these upper 
bounds by proving that they are essentially best possible in the streaming setting: it is shown 
that an approximation ratio of 2 — e (or 3/2 — e for proper intervals) cannot be achieved unless 
the space is linear in the input size. In passing, we also answer an open question of Adler and 
Azar [Tj regarding the space complexity of constant-competitive randomized preemptive online 
^ algorithms for the same problem. 



^ 1 Introduction 

In this paper we consider the interval selection problem, namely, finding a maximum cardinality 
psj subset of disjoint intervals from a given collection of intervals on the real line. It is well known 

that this problem has a simple optimal algorithm in the classical setting when the complete set of 
^""^ intervals is given to the algorithm [T3] . Here we study this problem in the streaming model |16t [22] , 

where the input is given to the algorithm as a stream of items (intervals in our case), one at a time, 
^ and the algorithm has a limited memory that precludes storing the whole input. Yet, the algorithm 

is still required to output a feasible solution, with a good approximation ratio. 

The motivation for the streaming model stems from applications of managing very large data 
sets, such as biological data (DNA sequencing), network traffic data, and more. Although some 
function of the whole data set is to be computed, it is impossible to store the whole input. Depending 
on the setting, different variants of the streaming model have been considered in the literature, such 
as the classical streaming model [16] or the so-called semi- streaming model Common to all 
of them is the fact that the space used by the streaming algorithm is linear in some natural upper 
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bound on the size of the output it returns (sometimes, a multiphcative polylogarithmic overhead 
is ahowed). 

In many problems considered in the streaming hterature, the size of the output is fully deter- 
mined by some parameter of the input, and thus, one would typically express the space complexity 
as a function of this parameter (cf. [H IS]). However, in other problems, the size of the output 
cannot be a priori expressed that way as it depends on the given instance; in such settings it is 
natural to seek a streaming algorithm whose space complexity is not much larger than the output 
size of the given instance (cf. [IS])- Clearly, as long as the computational model of the streaming 
algorithm is based on a Turing machine with no distinction between the working tape and the 
output tape, the size of the output is an inherent lower bound on the required space. 

In this paper, we consider a setting where the algorithm is given a stream of real-line intervals, 
each one defined by its two endpoints, and the goal is to compute a maximum cardinality subset 
of disjoint intervals (or an approximation thereof). This problem finds many applications, e.g., in 
resource allocation problems, and it has been extensively studied in the online and offline settings 
in many variants. We seek algorithms with a good upper bound on the space they use for a given 
instance, expressed in terms of the size of the output for that specific instance. Typically, we seek 
algorithms that use space which is at most linear in the size of the output and yet guarantee a good 
approximation ratio. 

Related Work. The offline interval selection problem corresponds to finding a maximum indepen- 
dent set in an interval graph. An optimal greedy algorithm was discovered early |14j and has since 
been a staple of algorithms textbooks [8l[T7]. It should be noted that the input can be given in (at 
least) two different ways: as an intersection graph with the nodes corresponding to the intervals, 
or as a set of intervals given by their endpoints. This distinction makes little difference in the 
traditional offline setting, where switching between these representations can be done efflciently, 
however, it can be important in access- or resource-constrained settings. We choose to study the 
interval selection problem assuming the latter representation — that is, the input is given as a set 
of intervals — since we believe that it makes more sense in applications related to the online and 
steaming settings (most previous works on online interval selection make the same assumption). 

The study of space-constrained algorithms goes back at least to the 1980 work of Munro and 
Paterson on selection and sorting [2T]. More recently, the streaming model was developed to 
capture the processing of massive data-sets that arise in practice [22] . Most streaming algorithms 
deal with the approximate computation of various statistics, or "heavy hitters", as exemplified by 
the celebrated paper of Alon, Matias, and Szegedy 

A number of classic graph theoretic problems have been treated in the streaming setting, for 
example, matching problems [201 H^, diameter and shortest paths [II1II2], min-cut [3], and graph 
spanners [12]. These were mostly studied under the semi-streaming model, introduced by Feigen- 
baum et al. [11]; in this model, the algorithm is allowed to use nlog*^^^^(n) space on an n- vertex 
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graph (i.e., log*^'-^^(n) bits per vertex). Closest to our problem, the independent set problem in 
general sparse graphs (and hypergraphs) was studied in the streaming setting by Halldorsson et 
al. [15]. Geometric streaming algorithms have also been appearing in recent years, especially dealing 
with extent and ranges, such as [2]. 

There is a plethora of literature on interval selection in the online setting. Some papers capture 
the problem as a call admission problem on a linear network, with the objective of maximizing the 
number (or weight) of accepted calls. Awerbuch et al. [5] present a strongly [log A^] -competitive 
algorithm for the problem, where N is the number of nodes on the line (corresponding to the number 
of possible interval endpoints). This yields an 0(log A)-competitive algorithm for the weighted case, 
where A is the ratio between the longest to the shortest interval. On the negative side, Awerbuch et 
al. [5] establish a lower bound of Q{logN) on the competitive ratio of randomized non-preemptive 
online interval selection algorithms. In the context of the real line, this immediately implies that 
such algorithms cannot have competitive ratio that does not depend on the length of the input. In 
fact, Bachmann et al. [6] recently showed that the competitive ratio of randomized non-preemptive 
online algorithms for interval selection on the real line must be linear in the number of intervals in 
the input. Preemptive online scheduling has a lower bound of r2(log A/loglog A) in the weighted 
case [7j. In comparison, much better results are possible for preemptive online algorithms in the 
unweighted setting: Adler and Azar p!] devise a 16-competitive algorithm. One way of easing the 
task of the algorithm is to assume arrival by time, i.e., the intervals arrive in order of left endpoints. 
This has been treated for different weighted problems [231 EHl El US] ■ 

Our results. We give tight results for the interval selection problem in the streaming setting. Our 
main positive result is a deterministic 2-approximation streaming algorithm that uses space linear 
in the size of the output (Sec. [3]). This is complemented with a matching lower bound (Sec. [s]), 
stating that an approximation ratio of 2 — e cannot be obtained by any randomized streaming 
algorithm with space significantly smaller than the size of the input (which is much larger than the 
size of the output). The special case of proper interval collections (i.e., collections of intervals with 
no proper containments) is also considered, for which a deterministic 3/2-approximation streaming 
algorithm that uses space linear in the output size is presented (Sec. [4]); a matching lower bound 
on the approximation ratio is established (Sec. [sj for streams of unit intervals (a special case of 
proper intervals). The upper bounds are extended to multiple-pass streaming algorithms: we show 
that an approximation ratio 1 + l/{2p — 1) can be obtained in p passes over the input (Sec. [g]). 

In passing, we also answer an open question posed by Adler and Azar [1] in the context of 
randomized preemptive online algorithms for the interval selection problem. Adler and Azar point 
out that the decisions made by their online algorithm depend on the whole history (i.e., the input 
seen so far) and that natural attempts to remove this dependency seem to fail. Consequently, 
they write (using the term "active call" for an interval in the solution maintained by the online 
algorithm) that "it seems very interesting to find out whether there exist constant- competitive 
algorithms where each decision depends only on the currently active calls and maybe on additional 



3 



bounded information". We answer this question in the affirmative by slightly modifying our main 
algorithm to achieve a randomized preemptive online algorithm that admits constant competitive 
ratio and uses space linear in the size of the optimal solution, rather than the size of the input, as 
the algorithm of Adler and Azar does (Sec. Uurj 



2 Preliminaries 



We think of the real line M as stretching from left to right so that an interval I contains all points 
between its left endpoint left(/) and its right endpoint right (/), where left(/) < right (/). Each 
endpoint can be either open (exclusive) or closed (inclusive). A half-open interval has a closed left 
endpoint and an open right endpoint. (This is, perhaps, the natural interval type to use in most 
resource allocation applications.) Observe that the assumption that left (I) < right (/) implies that 
every interval contains an open set (in the topological sense) and that half-open intervals are always 
well defined. 

The interval related notions of intersection, disjointness, and containment follow the standard 
view of an interval as a set of points. Two intervals I, J properly intersect if they intersect without 
containment; I properly contains J if I contains J and J does not contain /. An interval collection 
I is said to be proper (and the intervals in the collection, proper intervals) if no two intervals in I 
exhibit proper containment. The load of X is defined to be maXpgiR \{I £ I \ p £ I}\. 

The interval selection problem asks for a maximum cardinality subset of pairwise disjoint inter- 
vals out of a given set S of intervals. In the streaming model, the input interval set S is considered 
to be an ordered set (a.k.a. a stream) and the intervals arrive one by one according to that order. 
The intervals are specified by their endpoints, where each endpoint is represented by a bit string of 
length b (the same b for all endpoints). This may potentially provide a streaming algorithm with 
the edge of knowing in advance some bounds on the number of intervals that will arrive and on the 
number of intervals that can be placed between two existing intervals. However, our algorithms 
do not take advantage of this extra information and our lower bounds show that it is essentially 
useless. An optimal solution to a given instance S of the interval selection problem is denoted by 
Opt (5). 

We may sometimes talk about segments, rather than intervals, when we want to emphasize that 
the entities under consideration are not part of the input. Given a set I of intervals, a component 
(or connected component) of X is a maximal continuous segment in IJ/ex ^■ 

^ The technique employed in Sec. [t] is based on a "classify and randomly select" argument that guarantees that 
the solution produced by the online algorithm is a constant approximation of the optimal solution with constant 
probability. Using the technique of [18] (reformulated as Theorem 4.1 in [1]), this can be strengthened to guarantee 
a constant approximation with high probability. 
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3 The Main Algorithm 



Overview. Given a stream S of intervals, our algorithm maintains a collection A (1 S, referred 
to as the actual intervals, from which the output Alg(5) = Opt (A) is taken. It also maintains a 
collection V of virtual intervals, where each virtual interval is the intersection of two actual intervals 
that existed in A at some point. The role of the virtual intervals is to filter out undesired intervals 
from joining A: an arriving interval I £ S joins A if and only if it does not contain any currently 
maintained virtual or actual interval. 

Our algorithm is designed to guarantee that each interval I £ S leaves a trace in either A or 
V, namely, there exists some J £ AUV such that J Q I. Moreover, if 1,1' £ A properly intersect, 
then In I' £ V. This essentially means that an arriving interval is rejected if and only if it contains 
some previous interval of S or the intersection of two properly intersecting previous intervals in S 
that has belonged to A. 

Following that, it is not too difficult to show that the load of the interval collection A is at 
most 2. Based on a careful analysis of the structure of the (connected) components in A and the 
locations of the virtual intervals within these components and between them, we can argue that 
\V\ < \A\. This immediately yields the desired upper bound on the space of our algorithm as 
1^1 < 2 • |Opt(^)|. The bound on the approximation ratio essentially stems from the observation 
that |Opt(S')| < |Opt(ylU V)\ (a direct corollary of the fact that each interval in S leaves a trace in 
AUV) and from the invariant that each actual interval contains at most 2 virtual intervals. 

It is interesting to point out that our algorithm is in fact a deterministic preemptive online 
algorithm that maintains a load-2 interval collection (the collection A). Since the main result of 
Adler and Azar [Ij also relies on such an algorithm, one may wonder if the two algorithms can be 
compared. Actually, the algorithm of Adler and Azar bases its rejection (and preemption) decisions 
on similar conditions: an arriving interval is rejected if and only if it contains some previous interval 
of S or the intersection of two properly intersecting intervals in A. (Adler and Azar use a different 
terminology, but the essence is very similar.) The difference lies in the latter condition: Whereas 
the algorithm of Adler and Azar considers only the properly intersecting intervals that are currently 
in A, our algorithm also (implicitly) considers properly intersecting intervals that belonged to A 
in the past and were preempted since. This seemingly small difference turns out to be crucial as 
it facilitates our algorithm to use much less memory, thus giving rise to an interesting phenomena: 
by remembering extra information (i.e., intersecting intervals that belonged to A in the past and 
are not in A anymore), we actually end up using less memory. 

The algorithm. Consider a stream S = (/i, . . . , /„) of intervals on the real line. It will be con- 
venient to assume that all endpoints are distinct, i.e., {left(/), right (/)} R {left( J), right( J)} = 
for every two intervals I, J £ S. Unless stated otherwise, we will also assume that the inter- 
vals mentioned in this section are closed on both endpoints. These two assumptions are lifted in 
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Appendix [Aj 

Our algorithm, denoted Alg, maintains a collection yl C 5" of actual intervals and a collection 
V of virtual intervals, where each virtual interval is realized by endpoints of intervals in S. That is, 
the virtual interval I £ V satisfies {left (/), right (/)} C {left( J), right( J) | J € S}. The algorithm 
initially sets A,V f}>. Then, upon arrival of a new interval I £ S, Alg proceeds according to the 
polic}j^ presented in Algorithm [ij 

Algorithm 1 The policy of Alg upon arrival of an interval I £ S 
1: i{3J £ AUV s.t. J CI then 
2: reject / and halt 
3: AU{I} 
4: for all J G A s.t. J D / do 
5: A^A-{J} 
6: for all J G F s.t. J D / do 
7: V ^V-{J} 
8: for p £ {left (/), right (/)} do 
9: if 3J £V S.t. p £ J then 

10: V -{J}u{inJ} 

11: else if 3 J G ^4 s.t. p£ J then 

12: y ^ y u {/ n J} 

13: for all J G A and K £ V do 

14: if left( J) < Mt{K) < right(ir) < right(J) then 
15: A^A- {J} 



The algorithm first verifies that the new interval / does not contain any currently stored (actual 
or virtual) interval; if it does, then the new interval is ignored (rejected). Therefore, if Alg reaches 
linejsj then we can assume that I ^ J for any interval J £ AuV. Next, in lines [4]-[7] Alg removes all 
the actual and virtual intervals that contain /. Lines[8 - 12 form the heart of the algorithm: updating 
the virtual intervals that remain in V. The idea here is that a virtual interval that intersects with / 
is "trimmed" until it is contained in /; if an actual interval intersects with /, then the intersection 
is introduced as a new virtual interval. Finally, any actual interval J that exclusively contains some 
virtual interval K (that is, J contains K even if we remove J's endpoints) is removed from the 
actual interval collection A in lines I13H151 

After the last interval /„ is processed, Alg outputs Alg(5) = Opt(^), that is, an optimal subset 
in the interval collection A (computed, say, by the greedy left-to-right algorithm). In the remainder 
ofthis section we prove that: (a) at ah times, \V\ < \A\ < 2-|Alg(5)|; and (b) |Alg(S')| > |0pt(5)|/2. 
Together, we obtain the desired approximation with space at most constant times larger than the 
size of the output. 

^ Note that Alg can be thought of as an onhne algorithm with preemption with respect to the set A. 
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Analysis. Throughout the analysis, we let 1 < t < n denote the time at which Alg completed 
processing interval It £ S; time t = denotes the beginning of the execution. We refer to the period 
between time t — 1 and time t as round t. The stream prefix (Si, . . . , St) is denoted by St- The 
collections A and V at time t are denoted by At and Vt, respectively, although, when t is clear from 
the context, we may omit the subscript. We begin by showing that each virtual interval is indeed 
realized by (at most) two actual intervals and that the new interval / is not removed immediately 
after joining A. 

Proposition 3.1. At any time t, we have {left(/3), right(/o) | p G Vt} ^ {left (u), right (u) | a G St}- 



Proof. By induction on t. The case t = is trivial as Vq = 0. For time t > 0, we observe that 
any new virtual interval p added to V in round t is either the intersection of two actual intervals 



(line 12) or the intersection of an actual interval and a virtual interval in Vt-i (line 10). In the 
former case, the assertion follows immediately; in the latter case, the assertion follows by the 
inductive hypothesis. □ 

Proposition 3.2. For every 1 <t < n, i/ Alg reaches line\^ when processing It = I, then I £ At. 
Proof. In line[3j / is added to A and subsequently, it can only be removed from A if a virtual interval 



p that is contained in I but does not have an endpoint in common with V is found (line 15). Such 
an interval p cannot be in Vt-i since otherwise, / would have been rejected in line [2] The assertion 
follows since every virtual interval added to V in round t has a common endpoint with I. □ 



Lemma 3.3 lies at the core of our analysis: it states that each interval in S leaves some trace 
in either A or V. This will be employed later on to argue that Alg(S) is not much smaller than 
Opt (5). 

Lemma 3.3. For every interval It £ S and for every time t' > t, there exists some interval 
p G At' U Vf such that p Q It- 



Proof. A new coming interval / is added to A in line [2] unless some interval /? C / is found in AUV. 
An actual interval p G ^ is removed from A only if another actual interval I ^ p has just joined A 
(linejsj) or if a virtual interval a C p is found in V (line|15[). A virtual interval p gV \s removed 
from V only if an actual interval I ^ p has just joined A (linej7|) or if it is replaced in V by another 
virtual interval cr C p (line 10). The assertion follows. □ 



The structural lemma. We now turn to establish our main lemma regarding the updating phase 
in lines [8 -12 and the resulting structure of the interval collections A and V. Lemma 3.4 states 
seven invariants maintained by our algorithm; these invariants are then proved simultaneously by 
induction on t, essentially by straightforward analysis of the policy presented in Algorithm [T| 
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Lemma 3.4. For any round 1 <t < n, the updating phase satisfies the following two properties: 

(PI) If p is added to V in round t, then p £ Vt. 

(P2) If p and a are added to V in round t, then pDa = 9- 

Moreover, for any time < t < n, the interval collections A and V satisfy the following five 
properties: 

(P3) For every p £ A and a £ V , if pD a ^ f/>, then a C p with a common endpoint. 

(P4) For every p,a £ A, if pr\a then pCi a £ V. 

(P5) Every point p £M is contained in at most 1 virtual interval. 

(P6) Every point p £M is contained in at most 2 actual intervals. 

(P7) There do not exist two actual intervals p,a £ A such that p C. a. 



Proof. We first establish (PI) regardless of the other six properties. 



Establishing (PI). It is sufficient to show that if p is added to V in lines 10 or 12 of the execution 



for p = left(/), then it is not removed from V in line 10 of the execution for p = right (/). Indeed, 
if p is added to V in the execution for p = left(/), then p = I Da for some interval a £ At-i U Vt-i 
such that left(/) £ a. Since a cannot contain / (as otherwise, it would have been removed in lines 
[5] or [7]), it follows that left(a) < left(/) < right(o") < right(/), so p = [left (/), right (cr)]. Therefore, 



right(I) ^ p and p is not removed from V in line 10 of the execution for p = right(/). 



Next, we establish (P2), (P3), (P4), and (P5) simultaneously by induction on t. The case t = 
is trivial: (P2) holds vacuously, while (P3), (P4), and (P5) hold as Aq = Vq = 0. Assume that the 
four properties hold for t — 1 and consider the execution of Alg upon arrival of interval I = It for 
some 1 < t < n. 



Establishing (P2). As each iteration of the for loop in lines [8|-p^ adds at most one virtual interval 
to V, we may assume that p is added in the execution for p = left(I) and a is added in the execution 
for p = right(I). This means that p = I Dti and a = I CiTr for some intervals Ti, £ At^i U Vt-i 
such that left(/) £ ti and right(/) £ Xf. We argue that and do not intersect, which implies 
that p and a do not intersect. 

To that end, assume by way of contradiction that they do and let tq = n r^. If both and 
Tr are virtual intervals, then we immediately reach a contradiction due the inductive hypothesis on 



(P5). If both Tn and Tr are actual intervals, which means that p and a are added to V in line 12 
then by the inductive hypothesis on (P4), Tn £ Vt~i- By definition, Tn must intersect with I. On 
the other hand, neither left(/) nor right(/) can belong to tq as otherwise, the else condition in 



line 11 would not have passed, thus tq C /. But this means that Alg should not have reached line [3] 



and in particular, p and a would not have been added to V . 

So, assume that is actual and is virtual (the proof of the converse possibility is identical). 
By the inductive hypothesis on (P3), we know that Tj- C T£. But this implies that both endpoints 
of / belong to r^, namely, / C n, and Tg should have been removed from A in line [5] 
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Establishing (P3). Consider some p G At and a G Vt such that p n cr 7^ 0. If p G ^t-i and 
a £ Vt-i, then the property holds by the inductive hypothesis. Assume first that p is added to A 
in round t, so p is the last arriving interval /. Notice that a cannot be in Vt-i as this implies that 
either (i) cj C /, in which case / would have been rejected in line [2| (ii) a 5 /, in which case a 
would have been removed from V in line [Tj or (iii) a and I properly intersect, in which case a is 
removed from V in line 10 Thus, a is added to V in round t either in line 10 or in line 12 In both 
cases, a is contained in V with a common endpoint. 

It remains to consider the case in which p £ At-i and a is added to V in round t. If a is added 



to V in line 10, then it replaces in V some interval r G Vt-i such that a Q t. Hence, r must also 
intersect with p and by the inductive hypothesis, r C p, so a must be contained in p. Since p is 



not removed in line|15[ p and a must have a common endpoint. If a is added to V in line 12 then 
a = I Dt for some interval r G At-i such that the endpoint p of / is contained in r. The property 
is established by arguing that r and p must be the same interval. 

To that end, suppose toward a contradiction that t ^ p. Assume without loss of generality that 
p = left(I), so left(r) < left(/) < right(T) < right(/). Since a = IDt = [left(/), right(r)] intersects 
with p, both / and r must also intersect with p. By the inductive hypothesis on (P4) , we know that 
(Tn = pCiT £ Vt-i- We also know that fin intersects with / as both p and r intersect with /. Since 
r ^ /, it follows that o"n ^ /, hence o"n must still be in V when Alg reaches linejsj If left(/) G Un, 
then the else condition in line [Tl] would not have passed and u would not have been added to V in 
line 12, so left(I) ^ cn- But right(/) ^ fin as right(/) ^ r, hence (Xn ^ ^-nd / should have been 
rejected in line[2| In any case, we conclude that p and r are indeed the same interval. 

Establishing (P4). Consider two intersecting intervals p,(T £ At- If both p and a are also in 
At-i, then by the inductive hypothesis, t = pr\a £ Vt-i- lir ^ Vt, then it must have been removed 
from V either in line [7] because / C r, in which case / is also contained in both p and a and they 
would have been removed from A in line[5j or in line 10 where it is replaced in V by some other 
virtual interval r' C r (the strict containment follows from the distinct endpoints assumption), in 
which case at least one of the intervals p and a should have been removed in line 15 Therefore, 
T £ Vt and the property holds in that case. 

So, suppose that p £ At-i, while o" = I is added to A in round t. Since p,l£ At, both p and 
/ are in A when Alg reaches line [8| thus they cannot contain each other. Assume without loss 
of generality that left(p) < left(I) < right(p) < right(/). If left(/) does not belong to any virtual 
interval in Vt-i, then in line [12] the virtual interval r = p n / is added to V and it must still be 
there at time t due to (PI). So, assume that left(/) belongs to some virtual interval r G Vt~i- Since 
r intersects with p, the inductive hypothesis on (P3) implies that r C p with a common endpoint. 
In line 10, r is replaced in V by the new virtual interval r' = r n /, which, by (PI) remains in V 
at time t. The interval r' intersects with both p and /, hence, by (P3) (applied to time t), it is 
contained in both of them, having a common endpoint with each, thus t' = pDcr and the property 
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holds. 

Establishing (P5). Suppose toward a contradiction that there exists two distinct intervals p,cr £ 
Vt such that pDa ^ f/>. Assume without loss of generality that a was added to V after p. By the 
inductive hypothesis, a is added to V in round t, while (P2) guarantees that p £ Vt-i- If a is added 



to V in line 10, then a = I H t for some virtual interval r which is guaranteed to be in Vt^i by 



(P2). But then the inductive hypothesis implies that p n r = 0, thus pDa 



So, assume that a is added to V in line 12 In that case a = I Ht for some r G At^i. Assume 
without loss of generality that left(r) < left(/) < right(r) < right(/) so that a = [left (/), right (r)] 
is added to V for p = left(/). Since p intersects with a, it must also intersect with both / and r. We 



know that p cannot belong to p as otherwise, the else condition in line 11 would not have passed. 
But, by the inductive hypothesis on (P3), p C t, thus p ^ I and / should have been rejected in 
line [2 

Properties (P6) and (P7) can now be established based on the other properties. 

Establishing (P6). Consider some point p G M and suppose toward a contradiction that there 
exist three distinct intervals pi,p2,P3 £ At such that p £ pi for every 1 < i < 3. By (P4), the 
intersections ai^2 = Pi H p2, cri^, = piCi ps, and (T2,3 = P2<^ Ps are all in Vf. But (P5) implies that 
ci,2) en. 3) and cr2,3 are pairwise disjoint, in contradiction to their definition. 

Establishing (P7). Consider any two intervals p,cr £ Af. If p Pi o" 7^ 0, then (P4) implies that 
p n a £ Vt. By (P3), pCi a is strictly contained in both p and cr, hence p cannot be a subset of a 
(nor can o" be a subset of p). □ 



The components. We employ Lemma 3.4 in order to understand the structure of the components 
of A and their relations with the intervals in V. To that end, fix some time t and consider an 
arbitrary component C formed as the union of the actual intervals pi, . . . , pk £ At. We denote the 
leftmost and rightmost points in (the segment) C by left(C) and right(C), respectively. 

Assume without loss of generality that left(pj) < left(pi+i) for every 1 < i < k — 1. 
Lemma |3.4[ P6) and (P7) then guarantee that 

left(pj_i) < left(pi) < right(pi_i) < left(pi+i) < right(pj) < right(pi+i) 

for every 2 < i < k - 1. By Lemma |3^P4), we conclude that pi D pj+i £ Vt for every 1 < i < 
k — 1, while Lemma |3.4[ P3) implies that the segment [left(p2), right(pfc_i)] does not intersect with 
any other virtual interval in Vj. The segment C possibly contains two more virtual intervals at 
time t: an interval ae C [left (pi ), left (p2)) and an interval Ur C (right(pfc_i), right(pfc)], but then 
Lemma [3^ P3) guarantees that left(cj^) = left(pi) = left(C) and right((Tr) = right(pfc) = right(C). 
An illustration of a component is provided in Figure [T| There may also exist virtual intervals 



in between the components of A, but Lemma 3.5, to be stated soon, essentially shows that their 
number and structure are fairly limited. 
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Pi P3 



P2 Pi 

Figure 1: A component C oi A. The solid lines depict the actual interval pj, i = 1, ... ,5; the 
dashed lines depict the virtual intervals contained in C. 

Let denote the collection of the components of At. To simplify the analysis, we subsequently 
assume the existence of two permanent components of A, one to the left of all other components 
and one to their right. (We can think of these components as being located in — oo and +c«, 
respectively.) The two permanent components are assumed to be included in for every < t < n, 
so in particular, \^t\ > 2. 

A point p € M is said to be dirty at time t if there exists some interval p £ Vt such that p G p. 
Interval a is called isolated at time t if a £ Vt and aOp = 9 for every interval p £ Af. Consider two 
adjacent components Ci,Cr G ^'t, where Ce is to the left of Cr- We say that the pair {Ce,Cr) is 
solid at time t if at most one virtual interval in Vt intersects with the segment [right (C^), left (Cr-)], 
namely, if at most one of the following three events occur: (a) right(C£) is dirty at time t; (b) 
left(Cr) is dirty at time t; or (c) there exists an isolated interval in between and Cr at time t. 



Lemma 3.5 states that the pair {Ce,Cr) is always solid. 



Lemma 3.5. At every time < t < n, all pairs of adjacent components in CLfc solid. 

It is important to point out that the notion of a solid pair (C^, Cr) of adjacent components 
is stronger than merely claiming that there is at most one isolated interval between Ci and Cr', 
indeed, the latter allows for a scenario in which more than one virtual interval intersects with the 
segment [right (Q), left (C^)]. This seemingly insignificant distinction turns out to be crucial for our 



analysis (see the proof of Lemma 3.6) 



Proof of Lemma 3.5. The assertion is established by induction on t. It clearly holds at time t = 0, 
so assume that it holds at time t — 1 and consider the execution of Alg in round t. If I = It is 
rejected in line[2| then the assertion trivially holds at time t as At = At-i and Vt = Vt~i. Therefore, 
we assume hereafter that Alg reaches line [3] 

Let Di be the rightmost component in ^t-i such that right(L'^) < left(/) and let Dr be the 
leftmost component in ^t-i such that left(Dj.) > right(/). Fix G = (right(-D^), left(-Dr)) and let 

Wt-i = {C G -^t-i \C CG] and Wt = {C £ ■^t \ C C G] . 

Let Z be the set of isolated intervals a at time t — 1 such that ex n / 7^ 0. 

Clearly, the components located outside the segment G remain intact in round t. Each endpoint 
p of such a component is dirty at time t if and only if it is dirty at time t — 1. Moreover, a virtual 
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interval located outside G is isolated at time t if and only if it is isolated at time t — 1. Therefore, 
every pair of adjacent components located outside G remains solid at time t. The lemma is proved 
by considering the the pairs (C, C) of adjacent components such that at least one of them is located 
in G. For that we will need the following three facts. 

• < |Z| < 1. 

To see why this is true, note that if / intersects with two different isolated intervals at time 
t — 1, then by the inductive hypothesis there must exist a component C G Wt-i between them 
such that G C I. But then, I contains the actual intervals that form G and it should have 
been rejected in line[2| 

• < \Wt-i\ + \Z\ < 2. 

Indeed, by definition, / intersects with every component in Wt-i (this is not necessarily the 
case for the components in Wt). Therefore, if |W(_i| + \Z\ > 3, then / must contain either 
a component in Wt-i or a virtual interval in Z. As I was not rejected at line [2| neither of 
these can occur. 

• 1< \Wt\ < 3. 



The fact that \Wt\ > 1 follows directly from Proposition 3.2 To see that \Wt\ < 3, we note 
that an interval p £ At-i that does not intersect with /, must also be in At. Indeed, p is 
removed from A in line [5] only if it contains / and in line 15 only if it contains a virtual 



interval a that was recently added to V ^ but this interval a is then contained in /. Now, if 
\Wt\ > 4, then there must exist two adjacent components Ci, C2 G Wt such that both of them 
are either to the left of / or to its right. Assume without loss of generality that Ci is to the 
left of G2 and G2 is to the left of /. Since I intersects with every component in Wt-i, Gi and 
G2 must have been part of the same component in Wt-i- In particular, there must exist some 
interval r G At-i that intersects with Ci such that r ^ At- But then, rn / 7^ 0, which means 
that r contains G2 and the actual intervals that form it, in contradiction to Lemma [3.4[ P7). 



We are now ready to complete the proof of Lemma [3. 5[ Assume first that \Z\ = 1, say Z = {cr}, 
and that / H cr 7^ 0. If / C o", then Wt-i = 0, Wt = {G}, where G is formed only from I, and 
both {Di, G) and (C, Dr) are solid at time t. This remains true if / properly intersects with a and 
Wt~i = 0. Otherwise, if / properly intersects with a and with (the unique) Gt-i G Wt^i, say, 
left(cr) < left(/) < right(o") < left(C) < right(/) < right(C), then G does not contain any isolated 
interval at time t, Wt consists of a single component Gt, left(Ct) = left(I) is dirty at time t, and 
right(Ct) is dirty at time t if and only if right(Cf-i) is dirty at time t—1. It follows by the inductive 
hypothesis that both {D£,Gt) and {Gt,Dr) are solid at time t. 

So, we subsequently assume that / does not intersect with any isolated interval at time t — 1. 
We may also assume that if p C G is isolated at time t, then it is also isolated at time t — 1. Indeed, 
topology-wise, there is only one scenario that leads to the creation of a new isolated interval. In this 
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scenario p G Vt_i, pH I = 9, \Wt-i\ = 1, say, Wt-i = {Ct-i}, and there exists an actual interval 
cr G Af-i that contains both p and /, the former with a common endpoint. Assume without loss of 
generality that right(p) < left(/). Then, \Wt\ = 1, say, Wt = {Ct}, left(Ct) = left(/) is not dirty at 
time t, and right(Ct) is dirty at time t if and only if right(Ci_i) is dirty at time t — 1. Once again, 
it follows by the inductive hypothesis that both {D£,Ct) and {Ct,Dr) are solid at time t. 

The lemma is now established by the following five observations. 

• Suppose that = 0. Then \ Wt\ = 1, say, Wt = {C}, and neither of the endpoints of C 
is dirty at time t. 

• Suppose that |Wt-i| = \Wt\ = 1, say, Wt-i = {Ct_i} and Wt = {Ct}. If left(Ct) (respectively, 
right(Ct)) is dirty at time t, then left(Ct_i) (resp., right(Ct_i)) is dirty at time t — 1. 

• Suppose that = 2, say, Wt-i = {Bi,Br}, where is to the left of Br- Then \Wt\ = 1, 
say, Wt = {C}, and if left(C) (respectively, right (C)) is dirty at time t, then leh{Bi) (resp., 
right(Sr)) is dirty at time t — 1. 

• Suppose that |W(_i| = 1, say, Wt-i = {B} and \Wt\ = 2, say, Wt = {Ci,C2}, where Ci is to 
the left of C2. Then at most one of the endpoints right(Ci), left(C2) is dirty at time t and if 
left(Ci) (respectively, right(C2)) is dirty at time t, then left(S) (resp., right(-B)) is dirty at 
time t — 1. 

• Suppose that |VFf_i| = 1, say, Wt-i = {B} and \Wt\ = 3, say, Wt = {Ci,C2,C3}, indexed 
from left to right. Then neither of the endpoints of C2 is dirty at time t and if left(Ci) 
(respectively, right(C3)) is dirty at time t, then left(i3) (resp., right(i?)) is dirty at time t — 1. 

The assertion follows. □ 

Accounting. Consider some time 1 < t < n and let ^t = {Co, . . . ,Cm+i}, where the CjS are 
indexed from left to right. Recall that components Co and Cm+i are permanent components whose 



existence is assumed only for the sake of simplifying the statement (and proof) of Lemma 3.5 In 
fact, a closer examination at the proof of this lemma reveals that no virtual interval in Vt intersects 
with the segment (— 00, left(Ci)] nor with the segment [right(Cm), +00), where Ci and Cm are the 
leftmost and rightmost components, respectively. This leads us to the following lemma. 
Lemma 3.6. |Alg(S'f)| > |0pt(S't)|/2 at every time <t <n. 



Proof. Lemma 3.3 guarantees that |Opt(S't)| < \Opt {At U Vt)\. As |Alg(5i)| = |Opt(Ai)|, it is 



3.5 



sufficient to bound the ratio R = showing that it is at most 2. Recall that Lemma 

implies that for every 1 < i < m — 1, at most one of the following three events occur: (a) right(Ci) is 
dirty; (b) left(Cj+i) is dirty; or (c) there exists an isolated virtual interval in between Ci and Cj+i. 
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Clearly, the ratio R can only increase if event (c) always occurs, so we subsequently assume that 
this is indeed the case. We will increase R even further by assuming that there exists an isolated 
virtual interval to the right of Cm- 

Consider some component C e "^t and let A{C) = {p & At \ p Q C} and V{C) = {cr £ Vt \ a C 
C}. It is easy to verify that \V{C)\ = \A{C)\ - 1 and that |Opt(^(C))| = [|^(C)|/2], whereas 



\Opt{A{C)UV{C))\ 



\v{c)\ ify(c)/0 

1 otherwise. 



Accounting for the isolated interval to the right of Cj, we conclude that each component Cj G 
contributes: 

(i) 1 to the denominator of R and 2 to the numerator of R, if V{Ci) = 0; and 

(ii) [|A(Ci)|/2] to the denominator of R and \A{Ci)\ - 1 + 1 = \A{Ci)\ to the numerator of R, if 

Via) + 0. 

The assertion follows. □ 

Corollary 3.7. |Alg(5)| > |0pt(5)|/2. 

It remains to bound the space of our algorithm, showing that it is linear in the length of the bit 
string representing Alg(S'). At each time i, the space of Alg is linear in the length of the bit strings 
representing At and Vf As 0pt(5t)/2 < Alg(5'j) < Opt(5j) for every < t < n, and since 0pt(5't) 
is non-decreasing with t, it is sufficient to show that \At\ + |Vt| = 0(|Alg(5j)|) = O(|0pt(j4f)|). 



By Lemma |3.4[ P6), we know that the actual intervals in At can be colored in two colors such that 
if two intervals belong to the same color class, then they do not intersect. Thus, \At\ < 2-|0pt(^()| at 
every time t. On the other hand. Lemma [3 . 5 1 implies that if we count the actual and virtual intervals 
by scanning the real line from left to right, then the number of virtual intervals never exceeds that 
of the actual intervals]^ Therefore, \Vt\ < \At\ which establishes the following corollary. 
Corollary 3.8. At every time t, the space o/Alg is linear in the length of the hit string representing 
Alg(5). 



4 Proper Intervals 

In this section we consider the interval selection problem for half-open proper intervals. There is 
an easy deterministic 2-approximate streaming (and online) algorithm that uses no extra space in 
addition to storing the output: simply greedily add an interval whenever possible. We give here a 
streaming algorithm with an improved approximation ratio of 3/2, using output-linear space. As 
we show in Sec. [5} that is in fact optimal. 



^ In passing, this is also showed in the proof of Lemma 3.6 
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Let the support L = Uj^sl be the subset of the real hne covered by intervals in the input. 
Each maximal continuous segment in L is referred to as a (connected) component. The maximal 
segments on the line that are outside components are called out-regions. 

The algorithm maintains a partition of L into zones, which are segments on the line with the 
property that no input interval has both endpoints in the same zone. A zone is said to be empty if 
no endpoint of an interval falls in it, and non-empty otherwise. The zones are classified as either 

flexible or fixed, where only fixed zones have acquired a permanent boundary. Each component in 
the input consists of a series of fixed zones, with flexible zones on its left and right. The flexible 
zones all have one permanent endpoint, that is in common with an adjacent fixed zone, and the 
other endpoint non-permanent that can extend further into the adjacent out-region. The zones can 
be either closed, half-closed, or open segments, but each component induces a closed segment. 

The zones are updated as follows. When an interval I is received, we consider the following 
cases: 

1. If both endpoints of I are outside components (i.e., / intersects no previous interval), then 
we form a new fixed zone [left (/), right (/)) and create a flexible zone [right (/), right (I)]. 

2. If both endpoints of / fall into zones in the same component, then nothing needs to be done. 

3. If the endpoints of / belong to zones in different components, then both of those zones are 
fixed (if needed). The out-region between the respective components, along with any flexible 
zones properly included in I, is turned into one more (possibly empty) fixed zone. 

4. If one endpoint (say x) of / falls in a zone (say k) and the other (say y) in an out-region, then 
zone k is fixed (if needed). Suppose without loss of generality that y is the right endpoint of 
/, and let C be the component containing x. If a flexible zone z is contained in {x,y], then z 
is extended to include y (by adding the segment (right(C),y] into the zone z). Otherwise, a 
new flexible zone (right(C), y] is created. This is the only case where a zone changes its size. 

This completes the specification of the zones. Note that for the special case of unit intervals, 
defining the zones to be simply [k,k + 1) for k £ Z suffices for our analysis. 

Define bzone{I) {fzone{I)) be the zone in which left(/) (right(/)) falls. Define D = 
{bzone{I) , fzone(I) : / € S"} as the set of non-empty zones. Assume that the zones are dynamically 
enumerated from left to right. 

For each non-empty zone k G D, the algorithm maintains information about two intervals: L^, 
the interval with the leftmost left endpoint in the zone, and i?^, the one with the rightmost right 
endpoint in the zone. Namely, = argmin^^{,2o„e(/) l'3ft(/) and Rk = argmax^^j^Q^g^j-) right(/). 
For convenience we write for an empty zone k that Lj^ = Rj./, where k' = max^ji £ D : t < k}, and 
Rk = Lk", where k" = mmt{t £ D : t > k}, 
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The output of the algorithm is the maximum interval selection of the set A = Ufc{Lfc,i?fc}, 
obtained by the classic left-right greedy algorithm. Let ALG be the set of intervals output by the 
algorithm and OPT be an optimal interval selection. For two proper intervals I and J, we write 
/ < J to denote that left(I) < left(J) (and thus also right(I) < right(J)), and similarly / < J if 
left(/) < left(J). 

This completes the specification of the algorithm. 

We claim that the specification of the zones allows the algorithm to properly maintain the 
invariant of storing and R^. Each time a non-empty zone is created (cases 1, 4), those values 
are initialized. Each time a zone is extended (case 4), say to the left, is updated with the newly 
presented interval. The region thus incorporated into the zone was previously outside zones, so 
contained no points. Hence, the claim. 

Lemma 4.1. For any interval I in the input, hzone[I) < fzone(I) < hzone{I) + 2. 

Proof. Observe that the zone specification maintains the invariant that each zone is properly covered 
by some input interval. Namely, the zones created in cases 1, 3, and 4 are by definitions properly 
covered, and the same happens in the extension of fiexible zones in case 4. This observation, along 
with the fact that the intervals are proper, ensures that the endpoints of each interval end in 
different zones, establishing the first inequality. 

Consider a point at the boundary of a connected component, let J be the interval contributing 
the point. We observe that J must have its endpoints in adjacent zones. Further, if z is fixed, then 
the other endpoint of J must be the nearest point in the adjacent zone. 

It remains to consider case 3. By the above observation, J properly contains no fixed zones. 
Thus, after the addition of J, it will properly contain only one zone. Hence, the second inequality 
follows. □ 



We also observe, using Lemma 4.1 , that the load of A is constant, hence the number of intervals 
stored is linear in the cardinality of the solution. 
Observation 4.2. |^| = 0{\ALG\) 

Finally, we argue the performance ratio of the algorithm. The following lemma captures the 
core of the argument. 

Lemma 4.3. Let R he a collection of three disjoint input intervals. Then, A contains a pair of 
intervals contained in the span of R. 

Proof. Let the three disjoint intervals be Ox < Oy < Oz. Our claim is that A contains a pair of 
disjoint intervals /, /' C [left(02:), right (O^)), the span of {Ox,Oy,Oz}- 



Let bi = hzone{Oi) and fi = fzone{Oi), for i = x,y,z, and note that by Lemma 4.1, < fx ^ 



by < fy < bz < fz- Consider the intervals Rj^ and Lb^. By definition, Ox < Rf^ and Li,^ < Oz, so 
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Rf^,Lh^ C [left (Oj;), right (O2)). Since fx < bz, it follows that Rj^ and Lj,^ are disjoint, establishing 
the claim. □ 

Theorem 4.4. \ALG\ > 3\OPT\/2. 

Proof. Let Oq, Oi, . . . , Op_i be the intervals in the optimal solution in order of endpoints, where 



p = \OPT\, and let N = \_{p—2)/3\ . From Lemma 4.3, we obtain that for each r = 0, . . . , A^ — 1, the 
algorithm finds at least two intervals within the segment [left (Osr+i), right (Osr+s)). Additionally, 
if kmin and kmax are the first and last non-empty zones, then L/.^-^ < Oq and Op_i < Rkmax- 
Hence, \ALG\ >2 + 2N = [2(p + 2)/3j > (2p + 2)/3. □ 



5 Lower Bound(s) 

In this section we establish lower bounds on the approximation ratio of randomized streaming 
algorithms for the interval selection problem, establishing the following two theorems. 
Theorem 5.1 (Lower bound for general intervals). For every real e > 0, integers ko,no > 0, and 
subexponential (respectively, sublinear) function s : N — )• N, there exist ko < k < c - ko, where c is a 
universal constant, n > uq, and an interval stream S such that 

(1) \S\=n; 

(2) I Opt (5) I = k; and 

(3) Alg(S') < A;(l/2 + e) for any randomized interval selection streaming algorithm Alg with space 
s{kb) (resp., space s{nb)), where b is the length of the bit strings representing the endpoints. 
Theorem 5.2 (Lower bound for unit intervals). For every real e > 0, integers k,nQ > 0, and 
subexponential (respectively, sublinear) function s : N — )■ N, there exist n > uq, and a unit interval 
stream S such that 

(1) \S\=n; 

(2) I Opt (5) I = k; and 

(3) Alg(S') < A;(2/3-|-e) for any randomized proper interval selection streaming algorithm Alg with 
space s{kb) (resp., space s{nb)), where h is the length of the bit strings representing the endpoints. 

Our lower bounds are proved by designing a random interval stream S for which every determin- 
istic algorithm performs badly on expectation; the assertion then follows by Yao's principle. (Our 
construction uses half-open intervals, but this can be easily altered.) Note that under the setting 
used by our lower bounds, the algorithm is required to output a collection C of disjoint intervals, 
and the quality of the solution is then determined to be the cardinality of C n S*. In other words, 
the algorithm is allowed to output non-existing intervals (that is, intervals that never arrived in 
the input), but it will not be credited for them. This, obviously, can only increase the power of the 
algorithm. 

The (A;, n)-gadget. Fix some positive integer m whose role is to bound the space of the algorithm. 
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Our lower bounds rely on the following framework, characterized by the parameters k,n £ Z>o, 
denoted a {k,n)-gadget. Consider an extensive form two-player zero-sum game played between 
the algorithm (MAX) and the adversary (MIN), depicted by a sequence of k phases. Informally, 
in each phase t, the adversary chooses a permutation vr^ G P„, where P„ is the collection of all 
permutations on n elements, and an index it £ [n]. The algorithm observes nt (but not it) and 
produces a memory image Mt, i.e., a bit string of length m. The index it is handed to the algorithm 
after the memory image is produced. At the end of the last phase the algorithm tries to recover 
Trt{it) for t = 1, . . . ,k: it outputs some if G [n] based on the memory image Mt, index it, and all 
other memory images and indices. For each t such that it = TTt{it), the algorithm scores a (positive) 
point. 

More formally, the adversarial strategy is depicted by the choices of the permutations vrj and 
the indices it for t = 1, . . . ,k. We commit the adversary to make those choices uniformly at random 
(so, the adversary reveals its mixed strategy), namely, tt^ S,. P„ and it £r [n] for every t, where 
all the random choices are independent. The strategy of the algorithm is depicted by the function 
sequences {ft}t=i {gt}t=iJ where 

ff.PnX ({0, ir X N)*-^ ^ {0, ir and gt : {0, 1}™ x [n] x ({0, 1}'" x [n])''-' ^ [n] . 

Let To be the empty string and recursively defin^T^ = Tt-i o ft {irtjTt-i) o it. The payoff of the 
algorithm is the number of ts, 1 < t < k, such that 

gt (^fti'n-t,'i^t-i) ,H,{ft' iT^t','rt'^i) ,h'}t'^t) = T^tik) ■ 

In the language of the aforementioned informal description, the role of the function ft is to 
produce the memory image Mt based on the permutation vrt and all previous memory images and 
indices (whose concatenation is given by T^-i). The role of the function gt is to recover iitiit) based 
on the memory image Mt, index it, and all other memory images and indices. 

Note that the memory images Mt' and indices it', t' ^ t, do not contain any information on the 
permutation vrt on top of that contained in Mt. In particular, the entropy in TTt{it) given Mt, it, 
and {Mf ,it'}t'^t is equal to the entropy in TTt{it) given Mt and it. Therefore, it will be convenient 
to decompose the domain of the function gt : {0, 1}™ x [n] x ({0, l}*" x [n])'^~^ — t- [n] so that the 
({0, 1}™" X [n])'^~"'^-part determines which function gt : {0, 1}™" x [n] — t- [n] is chosen, and then this 
function gt is used to produce it based on Mt and it. Similarly, we decompose the domain of the 
function ft: PnX ({0, 1}'" x [n])*-^ {0, 1}™ so that the ({0, 1}™ x [n])*"^-part determines which 
function ft'-Pn^ {0, 1}™ is chosen, and then this function ft is used to produce Mt based on tt^. 

We now turn to bound the expected payoff of the algorithm as a function of k, m, and n. The 
key ingredient in this context is the following lemma, which is essentially a well known fact in 
slightly different settings; a proof is provided in Appendix [B] for completeness. 

We use the notation u o u to denote the concatenation of the string u to string v. 
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Figure 2: The relative locations of the intervals in an (n, 7r)-stack for n = 4. The left and right 
endpoints of interval Jj are located in the segments depicted by the bidirectional arrows whose 
length is A/2. The exact location within this segment is determined by 7r(i). In the construction 
of the 2-lower bound for general intervals, the bold rectangles correspond to the segments in which 
the stacks (or auxiliary intervals) identified with the left and right children of the current node are 
deployed assuming that the good interval is interval J2 (these segments do not intersect with the 
segments corresponding to the bidirectional arrows). 

Lemma 5.3. For every real a > and integer no > 0, there exists an integer n > no such that 
for every two functions f : Pn ^ {0, 1}™ and g : {0, 1}™ x [n] — )• [n], where m = an log n, we have 
KerPr.,ierln]{9ifiT^),i) = 7r(z)) < 2a. 

Corollary 5.4. For every real a > and integers k, uq > 0, there exists an integer n > no such 
that if m < on log n, then the expected payoff of the algorithm player in a {k,n)-gadget is smaller 
than 2ak. 

The (n, 7r)-stack. We now turn to implement a {k, n)-gadget via a carefully designed interval 
stream. As a first step, we introduce the {n,TT)-stack construction. Given an integer n > and 
a permutation vr G Pn, an (n,7r)-stack deployed in the segment [x,y), x < y, is a collection of n 
intervals Ji , . . . , J„ satisfying: 

(1) all intervals Jj are half open; 

(2) all intervals Jj have the same length right(Jj) — left(Jj) = An, where A = 2n-i/2 ' ^"^^ 

(3) left(Jj) = x + \{i — 1) + €Tr{i) for every i £ [n], where e = A/(2n). 

Note that this deployment ensures that left(J„) < right (Ji), hence the half open segment 
[left( J„), right (Ji)) is contained in Jj for every i E [n]. Moreover, the union of the intervals in 
the stack does not necessarily cover the whole segment [x,y); it is always contained in [x,y), 
though. The structure of an (n,7r)-stack is illustrated in Figure [2j 

The {k, n)-gadget is implemented by introducing k stacks, each corresponding to one phase, 
and some auxiliary intervals; the stack corresponding to phase t is referred to as stack t. The 
permutation vr used in the construction of stack t is vr^. The index it will dictate the choice of one 
good interval out of the n intervals in that stack. What exactly makes this interval good will be 
clarified soon; informally, the algorithm has no incentive to output an interval in a stack unless this 
interval is good. 

The k stacks are used both by the construction of the 2-lower bound for general interval streams 
and by that of the (3/2)-lower bound for unit intervals. The difference between the two construc- 
tions lies in the manner in which these stacks are deployed in the real line, and in the addition of 
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the auxiliary intervals. 



A (3/2)-lower bound for unit intervals. The interval stream that realizes the (/c, n)-gadget 
for the (3/2)-lower bound for unit intervals is constructed as follows. It contains k sufficiently 
spaced apart stacks, where the intervals in each stack are scaled to a unit length (so A = 1/n). 
Consider stack t and suppose that it is deployed in the segment [x,y), where y = x + 2 — l/(2n). 
Recall that the permutation that determines the exact location of the intervals in the stack is nt 
and that the good interval is Jj^ . 

After the arrival of the n intervals in the stack, two more half open unit auxiliary intervals are 
presented: 



Lt 



it — 1 it — 1 . 

X H 1, X H and Rt 

n n 



it -1/2 it -1/2 

x + - '— + l,x + - '— + 2 

n n 



In other words, the interval Lt (respectively, Rt) is located to the left (resp., right) of the leftmost 
(resp., rightmost) point in which left(JjJ (resp., right(Jjj)) may be deployed. It is easy to verify 
that except for the good interval Jjj that does not intersect with Lt and Rt-, every interval in the 
stack intersects with exactly one of these two auxiliary intervals. 

The best response of the algorithm would be to output the two auxiliary intervals and to try 
to recover the good interval Ji^. (Note that the payoff guaranteed by this strategy is at least 2 
per stack, whereas any other strategy yields a payoff of at most 2 per stack.) For that purpose, 
the algorithm has to recover the exact locations of the endpoints of Jj^ that implicitly encode 
iTt{it)- Observing that the endpoints in this construction can be represented by bit strings of length 



log(n) + log(A;), Theorem 5.2 follows by Corollary 5.4 



A 2-lower bound for general intervals. The interval stream that realizes the {k, n)-gadget for 
the 2-lower bound for general intervals is constructed as follows. Assume that k = 2^ — \ for some 
positive integer k and consider a perfect binary tree T of depth k. The k stacks are identified with 
the internal nodes of T so that stack t precedes stack f + 1 in a pre-order traversal of T. (In other 
words, if stack t is identified with node u and stack t' is identified with a child of u, then t < t' .) 
In addition to the intervals in the stacks, we also introduce 2'^ = k-\-l auxiliary intervals which are 
identified with the leaves of T; these auxiliary intervals arrive last in the stream. We say that an 
interval J is assigned to node u \i J belongs to the stack identified with u or if n is a leaf and 
J is the auxiliary interval identified with it. 

The deployment of the stacks and the auxiliary intervals in M is performed as follows. Stack 
1 (identified with T's root) is deployed in [0,1). Given the deployment of stack t identified with 
internal node u G T in the segment [x,y), we deploy the stacks identified with the left and right 
children of u in the segments 

cr£ = [x + \{it - 2>/2),x + \{it - I)) and ar = [x + \{it + n - 1/2), x + \{it + n)) , 

respectively, where recall that A = 2n-i/2 • children of u are leaves in T, then we deploy 
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auxiliary intervals in those two segments instead of stacks, that is, one auxiliary interval in ai and 
one in ar- Refer to Figure [2] for illustration. 

The key observation regarding the choice of a£ and a-r is that 

left(Jj,_i) < left(cr^) < right(o-^) < left(JjJ and 
right(JjJ < left((T,.) < right(crr) < right(Jit+i) . 

In particular, this implies that: (1) the good interval in the stack identified with node u £ T does 
not intersect with any interval assigned to a descendant of u in T; and (2) a non-good interval in 
the stack identified with node u £ T contains every interval assigned to a descendant of either the 
left child of u or the right child of u in T. 

The best response of the algorithm would clearly include all the auxiliary intervals in the output, 
hence it can include an interval Jj of stack t in the output only if it is the good interval of that 
stack, namely, i = it- For that purpose, the algorithm has to recover the exact locations of the 
endpoints of Jj^ that implicitly encode irtiit)- Observing that the endpoints in this construction 



can be represented by bit strings of length log(n) • log(fc), Theorem 5.1 follows by Corollary 5.4 



6 Multiple-Pass Algorithms 

We extend now the streaming algorithms to use multiple passes through the data. First, some 
notation. For an interval /, let next{I) be the interval in the input that ends earliest among 
those that start after / ends, and let prev{I) be the interval that starts latest among those that 
finish before / starts. We use the notation nexf{I) defined recursively as / when i = and as 
next{next'^~^ {!)) for i > 0, and define prev^{I) similarly. Observe that if / is available before a 
pass, then a streaming algorithm can easily compute next{I) and prev(I) by the end of the pass, 
while maintaining 0(1) intervals in the memory at all times. 

The multi-pass algorithm runs as follows. The first pass consists of the earlier one-pass algo- 
rithm, either as the algorithm of Sec. [3] for general intervals, or the algorithm of Appendix |4] for 
proper intervals. The result of this pass is the set A, whichever base algorithm is used. Let A'^o = 
Pq = A. In round p > 1, the algorithm inductively computes Np^i = {next{I) : I £ Np~2} and 
Pp-i = {prev{I) : I G Pp-2}- Let Ap = Uf>o (iVi U Pi) = {nexf (I), prevail) : I G A,0 < i < p - 1} 
denote the combined set of intervals stored after pass p. When requested, the algorithm produces 
as output the maximum interval selection in Ap. This completes the specification of the algorithm. 

We first observe that \ Ap\ < {2p — 1)A, hence the space used in phase p is at most 2p — l larger 
than the length of the bit string representing A. 

Define the span of a set R of intervals to be the segment given by the leftmost and rightmost 
points in intervals in R. 
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Lemma 6.1. Given an input of general intervals, the set A computed by the algorithm Alg of 
Sec. satisfies the following property: for any pair of disjoint intervals Ii and I2 in the input, A 
contains an interval within the span of {Ii,l2} (given by [left (/i), right (12)); assuming Ii < I2). 

The following lemmas apply both to general or proper intervals. An interval is said to be 
end-simplicial if it contains either the leftmost right endpoint or the rightmost left endpoint of its 
connected component. 

Lemma 6.2. The set A contains all the end-simplicial intervals in the input. 



Proof. Regarding general intervals, recall from Proposition 3.1 that virtual intervals in Alg are 
formed by the intersection of two intervals in the input. Thus, if I is end-simplicial, it contains no 
virtual interval, and certainly no actual intervals. Hence, I is admitted to A and never rejected. 
For proper intervals, an end simplicial interval on the left (right) will always represent Rk {Lk) for 
its finishing (beginning) zone k. Thus, it is contained in A. □ 

Lemma 6.3. Let I be an interval in A and s < p — 1. Let R be a set of s + 1 disjoint intervals, 
including I. Then, Ap contains a set of s + 1 intervals within the span of R. 

Proof. Suppose R contains intervals Ii, I2, . . . , Is with I < Ii < I2 ■ . . < Is- (The case with intervals 
on the left of / is symmetric.) By definition, the intervals nexf{I), < i < s, are disjoint and 
contained in Ag+i and thus also in Ap. Also, by induction, next^{I) < Ii, for i = 1, . . . , s, and thus 
they fall within the span of R. □ 

Lemma 6.4. Consider any set R of m disjoint intervals in S, where m = 2p for general intervals 
and m = 2p + 1 for proper intervals. Then, Ap contains m — 1 intervals within the span of R. 



Proof. Follows from Lemma 6.3, along with Lemma 6.1 (Lemma 4.3) for general (proper) intervals 



respectively. □ 

Theorem 6.5. The multi-pass algorithm finds solution for interval selection of general intervals 
that is 1 + -approximate after each pass p. On proper intervals it is 1 -\- ^-approximate. The 
space used is 0{p) times the size of the output. 

Define m = 2p for general intervals and m = 2p + 1 for proper intervals. Consider an optimal 
interval selection with intervals /i, . . . , /|opt|) where a = \OPT\. Let r = a mod m, and q = 
[a/m\. Also let t = [r/2] and t' = [r/2\. For each Ri = {It-^-i+im, ■ ■ ■ , It+{i+i)m}j where i = 



. . . , q — 1, it holds by Lemma 6.4 that Ap contains m — 1 intervals within the span of R^. By 
Lemmas 6.2 and 6.3, Ap also contains t disjoint intervals within the span of [left(5), right(/j)] 
and t' disjoint intervals within the span of [left(/,„-t'+i), right(5)]. Hence, Ap contains at least 
q{m — l)-\-t-\-t' = a — q> a{m — 1) /m disjoint intervals. 
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7 Online Algorithm 



In this section we briefly show how to use the streaming algorithm presented in Sec. [3] to derive 
a randomized preemptive online interval selection algorithm. Our algorithm is 6-competitive and 
on top of maintaining at any time the set of currently accepted intervals A*, its only additional 
memory is an interval set of cardinality linear in the size of the current optimum. We thus answer 
an open question of Adler and Azar |^ about the space complexity of randomized preemptive online 
algorithms for our problem. 

Recall that our streaming algorithm maintains a set A of intervals. With respect to that set, 
our algorithm is a deterministic preemptive online algorithm, adding an interval to A only when 
that interval arrives, and possibly preempting it later. By Corollary |3.7[ the cardinality of the 
set A is at least half the cardinality of the optimal solution of the input seen so far. Moreover, 
Lemma |3.4| ^P6) guarantees that every interval added to A intersects with at most 2 previous 
intervals in A. Therefore, A is online 3-colorable: upon addition into A, each interval can be 
assigned one of three colors, such that intersecting intervals always have different colors. 

Our preemptive algorithm is now simple. We initially pick a random color c in {1, 2, 3}. We then 
run the streaming algorithm on each received interval /, adding / to A, and preempting intervals 
from A as does the streaming algorithm. If / is added to A we assign it a valid color from {1, 2, 3} 
in a first-fit manner. Our solution ALG consists of every interval J in A whose color is c. Clearly, 
£;[|>1LG|] = 1^1/3 > \OPT\/6, that is, the algorithm is 6-competitive. 
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APPENDIX 



A Lifting the Distinct Endpoints Assumption 

Recall that our analysis assumes that all the intervals in the stream S are closed and that their 
endpoints are distinct. In this section we show that these assumptions can be lifted. A quick glance 
at our algorithm reveals that it is essentially comparison-based, namely, it can be implemented via 
a comparison oracle C : — >■ {—1,0,+!} without accessing the interval's endpoints in any other 
way; given two endpoints p, q of intervals in S, the comparison oracle returns 



The assumption that all endpoints are distinct means that the algorithm and its analysis rely on a 
comparison oracle C : — )• {—1, 0, +1} with the additional guarantee that C'{p, 9) / whenever 
p ^ q. We shall refer to such a comparison oracle C as a distinct- endpoints comparison oracle. 

We show that for every stream S of intervals (the endpoints of these intervals may be arbitrarily 
open or closed) associated with a comparison oracle C, there exists a distinct-endpoints comparison 
oracle C such that for every two intervals I,J&S, the closure of I and the closure of J intersect 
under C' if and only if / and J intersect under C. Moreover, given an access to the comparison oracle 
C, the distinct-endpoints comparison oracle C can be implemented under our streaming model's 
space requirements. 

The distinct-endpoints comparison oracle C is designed as follows. Consider an endpoint p of 
an interval I e S and an endpoint q of an interval J e S, I ^ J. If C{p, q) ^ 0, then we set 
q) = C{p, q), so assume hereafter that C{p, q) = 0. Consider first the case in which p is a right 
endpoint and g is a left endpoint (the converse case is analogous). If at least one of the endpoints 
is open, then set C'{p,q) = —1; otherwise (both endpoints are closed), set C'{p,q) = +1. 

Now, consider the case in which both p and q are left endpoints (the converse case is analogous) . 
If p is open and q is closed, then set C'{p, g) = +1; if p is closed and q is open, then set C'{p, q) = —1; 
if both p and q are open or both are closed, then we set 



It is easy to verify that the closures of every two intervals intersect under C if and only if the 
intervals themselves intersect under C. Therefore, it remains to show that C can be implemented in 
the streaming model. Apart from an access to the original comparison oracle C, the implementation 
of C'(p, q) is based on: (1) knowing for each endpoint whether it is a left endpoint or a right endpoint; 



C{p,q) 



= < 



— 1 \ip < q 
\i p = q 
+1 \ip> q 




+1 if / (the interval of p) arrived before J (the interval of J) 
— 1 if / (the interval of p) arrived after J (the interval of J). 
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(2) knowing for each endpoint whether it is open or closed; and (3) knowing the order of arrival 
of intervals that share a left (respectively, right) endpoint. The first two requirements are clearly 
satisfied by the information provided in the input. For the third requirement, we note that if two 
intervals share a left (resp., right) endpoint p, then they must intersect. Thus, Lemma |3.4K P5) 
and Lemma [3.4| [P6) guarantee that at any given time, our algorithm maintains 0(1) intervals that 
have p as their left (resp., right) endpoint. A data structure that tracks the arrival order of these 
intervals can therefore be implemented with 0(1) additional bits per interval. 



B Proof of Lemma 5.3 



Let n be sufficiently large so that n(l + log(e)) < anlog(n). Suppose toward a contradiction that 
there exist two functions / : P„ — ;> {0,1}™ and g : {0,1}™ x [n] [n] such that F(g{f{Tr),i) = 
7r(i)) > 2a. We shall use these functions to construct a uniquely decodable coding scheme s : P„ — )• 
{0,1}* so that E,rer-Pn [1^(71") |] < log(n!). This contradicts Shannon's source coding theorem as the 
entropy of choosing tt uniformly at random from Pn is log(n!). 

In order to construct the coding scheme, we first define the vector G {0, 1}" for every tt G Pn 
by setting f7r(i) = 1 if g{f{Tr),i) = 7r(i); and v-,t{i) = otherwise. Let W-^ = {i £ [n] \ VT,-{i) = 0}. 
The coding scheme s is now defined by setting the codeword of each vr G -P„ to be 

s(7r) = v^o /{tt) Oiew^ 7r(i) , 

where QieW^T^ii) denotes a concatenation of the standard binary representations of 7r(i) for all 
i £ listed in increasing order of the index i. 

We first argue that s is indeed a uniquely decodable code. To that end, notice that for every 
TT G Pn and for every z G [n], we can extract the value of 7r(i) from s{tt) as follows: 

(1) Check in if the correct value of 7r(i) can be extracted from /(vr), that is, if Vn{i) = 1. 

(2) If it can (t'7r(«) = 1), then Tr{i) is extracted by computing ^(/(7r),i) (recall that /(tt) is found 
in the second segment of s(vr)). 

(3) Otherwise (t'7r(i) = 0), 7r(z) is extracted from the third segment of s(vr). 

Moreover, the coding scheme s is prefix-free (and hence uniquely decodable) since Vtt = t^Tr' implies 
that |s(7r)| = |s(7r')| for every two permutations 7r,7r' G P„. Thus, if the codewords s(7r) and s{it') 
agree on the first n bits, then they must have the same length, which means that s(7r) cannot be a 
proper prefix of s(7r'). 

It remains to show that E7re^p,J|s(vr)|] < log(n!). By definition, |s(7r)| = n + m + log(n) • \Wn\ 
for every vr G Pn, so 

E^G.pJkWI] = n + m + log{n)-E^^,.pJ\W^\] . 
The assumption that FnerPn,i&r[n]{9{f{'^),i) = 7r(i)) > 2a implies that Fn€rPn,i&r[n](.i ^ W^,) < 
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1 — 2q;, hence ]E7re^p„[|W7r|] < (1 — 2a)n. Plugging m = an\og{n), we conclude that 

^nerPnMT^)\] < n + (1 - a)nlog(n) . 

By the choice of n (satisfying n(l + log(e)) < Q;nlog(n)), we derive the desired inequality since 
log(n!) > nlog(n) — nlog(e). The assertion follows. 
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